class: center, middle, inverse, title-slide # Introduction to Visualization:
Overview, Tools, Best Practices, and Industry ###
Josemari Feliciano
Data Scientist, US Dept of Agriculture ###
Guest Lecture for BIS 633
Population and Public Health Informatics
October 26, 2022 --- <style type="text/css"> code.r{ font-size: 10px; } pre { font-size: 10px } </style> # Overview 1. Commonly-used visualization types and best practices (e.g., dislike for pie charts). 2. Tools in industry. 3. Forum to discuss any question you may have on visualization and anything data science (e.g., preparing for industry jobs). --- # Commonly-used visualization types The goal of this section is to give you a broad overview of commonly-used visualization types (e.g., bar charts, pie charts, maps). -- This section is mainly self-explanatory. I will only focus on few types that require or deserve contemporary commentaries. -- This course does not have a programming requirement, nevertheless I am providing you optional visualization materials using R. --- # Visualization types: Categorical Data What is categorical data? Variable type with two or more categories. Examples of patient information in a clinical trial that are categorical data: - Two categories (Binary): Gender assigned at birth (Male, Female). - Three or more categories: Race (White, Black and African American, Asian, Native American and Alaskan Native, Native Hawaiian and Other Pacific Islander, Mixed). --- # Visualization types: Categorical Data Clinical trial treatments with two treatment group: <img src="data:image/png;base64,#Images/trial1.png" width="60%" style="display: block; margin: auto;" /> Clinical trial treatments with three treatment groups: <img src="data:image/png;base64,#Images/trial2.png" width="60%" style="display: block; margin: auto;" /> --- # Visualization types: Categorical Data Two common ways to visualize categorical data using: pie charts and bar charts. <br> Creating and Visualizing Data Using Pie Charts: .pull-left[ ```r library(ggplot2) data <- data.frame(group = c("IDH Mutation (+)", "IDH Mutation (-)"), value = c(234, 1446)) ggplot(data, aes(x="", y=value, fill=group)) + geom_bar(stat="identity", width=1, color="white") + coord_polar("y", start=0) + theme_void() + guides(fill=guide_legend(title="Mutation Status")) ``` ] .pull-right[ <img src="data:image/png;base64,#VisualizationGuestLecture_files/figure-html/unnamed-chunk-5-1.png" width="80%" style="display: block; margin: auto;" /> ] --- # Issues with Pie Charts Many people in industry strongly dislike pie charts. __Reasons:__ -- - We are bad at judging angles. -- - There are better visualization types. -- - Not useful or unclear when you have several categories. --- # Issues with Pie Charts Many people in industry strongly dislike pie charts. .pull-left[ Code: ```r library(ggplot2) data <- data.frame(group = c("A", "B", "C", "D", "E"), value = c(10, 12, 14, 20, 19) ) ggplot(data, aes(x="", y=value, fill=group)) + geom_bar(stat="identity", width=1, color="white") + coord_polar("y", start=0) + theme_void() + guides(fill=guide_legend(title="Generic Groupings")) ``` ] .pull-right[ Output: <img src="data:image/png;base64,#VisualizationGuestLecture_files/figure-html/unnamed-chunk-7-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Visualization types: Categorical Data Recreate visualization using bar charts instead. .pull-left[ Code: ```r library(ggplot2) data <- data.frame(group = c("A", "B", "C", "D", "E"), value = c(10, 12, 14, 20, 19) ) ggplot(data = data, aes(x=group, y=value)) + geom_bar(stat="identity", fill="steelblue")+ geom_text(aes(label=value), vjust=1.6, color="white", size=3.5)+ theme_minimal() + guides(fill=guide_legend(title="Generic Groupings")) ``` ] .pull-right[ Output: <img src="data:image/png;base64,#VisualizationGuestLecture_files/figure-html/unnamed-chunk-9-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Visualization types: Categorical Data Comparison of pie chart vs bar chart .pull-left[ <img src="data:image/png;base64,#VisualizationGuestLecture_files/figure-html/unnamed-chunk-10-1.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#VisualizationGuestLecture_files/figure-html/unnamed-chunk-11-1.png" width="90%" style="display: block; margin: auto;" /> ] __Which one do you think is better? Do you see now why many dislike pie charts?__ --- # Pie Charts <img src="data:image/png;base64,#Images/Email.png" width="100%" style="display: block; margin: auto;" /> .pull-left[ <img src="data:image/png;base64,#Images/chart1.jpg" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#Images/chart2.jpg" width="80%" style="display: block; margin: auto;" /> ] <center> What is wrong with these images? Let's discuss. </center> --- ## Statistical and Distributional Charts This is a broad term I use to describe or relay statistical- and statistics-like information. These include charts that relay the distribution of a numerical value being measured (e.g., systolic and diastolic blood pressure). --- ## mtcars data: a primer mtcars is a built-in data set within R that we will be using for a number of the forthcoming visualizations. Below is the first ten rows of the mtcars dataset. ```r head(mtcars, 10) ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 ``` Full documentation of the data set can be found here: www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/mtcars --- ## Interpreting Box plots <img src="data:image/png;base64,#Images/boxplot_explanation.png" width="100%" style="display: block; margin: auto;" /> Think of downsides when it comes to reading boxplots as a visualization, specially if you are not statistically literate. --- ## Side by side comparison of histogram with boxplot <img src="data:image/png;base64,#Images/relationship_boxplot.png" width="100%" style="display: block; margin: auto;" /> --- ## Creating Box Plots in R ```r boxplot(mpg~cyl, data=mtcars, main="Car Mileage Data", xlab="Number of Cylinders", ylab="Miles Per Gallon") ``` <img src="data:image/png;base64,#VisualizationGuestLecture_files/figure-html/unnamed-chunk-18-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Alternatives to box plots: violin plot ```r library(ggplot2) ggplot(mtcars_v2, aes(x=cyl, y=mpg, fill=cyl)) + geom_violin() + theme_minimal() ``` <img src="data:image/png;base64,#VisualizationGuestLecture_files/figure-html/unnamed-chunk-20-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Beware of 'distortions' "Since the introduction of Policy X in 1955, traffic deaths in CT has significantly decreased." __If you are the general public, would you buy that argument? Why?__ <img src="data:image/png;base64,#Images/traffic.png" width="80%" style="display: block; margin: auto;" /> --- ## Critiquing one visualization you may have seen There is a poster presentation in the halls of LEPH (60 College) you may have seen. If you are a YSPH student, you likely have seen this visualization. I created the said visualization, and eventually improved it for the subsequent paper. As a group, let us critique the visualizations involved. --- <img src="data:image/png;base64,#Images/poster.png" width="100%" style="display: block; margin: auto;" /> This is the first draft of the said poster. But the poster in LEPH should be very similar. --- ## What I corrected in the paper .pull-left[ Poster: <img src="data:image/png;base64,#Images/corrected.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ What I actually published: <img src="data:image/png;base64,#Images/paper1.png" width="100%" style="display: block; margin: auto;" /> ] Final Paper: Feliciano JT, Salmi L, Blotner C, Hayden A, Nduom EK, Kwan BM, Katz MS, Claus EB. Brain Tumor Discussions on Twitter (#BTSM): Social Network Analysis. J Med Internet Res. 2020 Oct 8;22(10):e22005. doi: 10.2196/22005. PMID: 33030435; PMCID: PMC7582142. --- ## Dashboard Tools Many in the industry rely on interactive dashboards to tract internal metrics and key performance indicators (KPIs). You can create dashboard tools for free using both R (RMarkdown, Shiny, flexdashboard) and Python (Widgets, Voila, Dash by Plotly, Streamlit). --- ### Dashboards I've Created Let us take a quick look in some of the dashboard work I've completed in R to give you an idea what you can do with free dashboarding tools: - __COVID-19 Twitter Analysis Dashboard.__ Kwon J, Grady C, Feliciano JT, Fodeh SJ. Defining facets of social distancing during the COVID-19 pandemic: Twitter analysis. J Biomed Inform. 2020 Nov;111:103601. doi: 10.1016/j.jbi.2020.103601. Epub 2020 Oct 14. PMID: 33065264; PMCID: PMC7553881. Dashboard Link: https://samahfodehlab.github.io/ - __#HIVPrevention: An Infodemiology Study Dashboard.__ Burgess R, Feliciano JT, Lizbinski L, Ransome Y. Trends and Characteristics of #HIVPrevention Tweets Posted Between 2014 and 2019: Retrospective Infodemiology Study. JMIR Public Health Surveill. 2022 Aug 11;8(8):e35937. doi: 10.2196/35937. PMID: 35969453; PMCID: PMC9412898. Dashboard Link: https://neonseri.github.io/HIVPreventionDashboard --- ## Dashboard Tools Although there are free dashboarding tools, many industries are slow to adapt them into their enterprise solutions. Generally, there are two widely used dashboarding tools throughout the world: - Tableau (a Salesforce product) - Power BI (a Microsoft product) --- ## Market Share of Tableau vs Power BI Yale has a fair share of international students. So I want to share this image. <img src="data:image/png;base64,#Images/market.png" width="80%" style="display: block; margin: auto;" /> <small> __Image Citation:__ Tableau Software Vs Microsoft Power BI : In-Depth Comparison. Slintel. https://www.slintel.com/tech/business-intelligence-bi/tableausoftware-vs-microsoftpowerbi </small> --- ## My advice on learning these (not so free) tools Take advantage of free trials. - Both Power BI and Tableau have free trials. A big help in interviews if you can state you've at least used and explored these tools. Check the industry and country you may want to work in to help you determine where to focus first. Lots of free tutorials online! --- ## Geospatial Visualization - If you want to increase your likelihood of employment, I would learn the basics of GIS. - ESRI maintains a widely used tool called ArcGIS to help companies visualize geospatial data. ESRI provides free trainings on ArcGIS, so take advantage of free trainings they provide. - I also highly suggest learning how to visualize mapping data using R and Python. --- ## Geospatial Training I taught a detailed geospatial workshop earlier this year to RLadies in St Louis MO. Here is the link to the detailed presentation: https://neonseri.github.io/SpatialWorkshopPresentation It is difficult to get proper geospatial training, so let us skim through these materials so you are aware of the complexities of geospatial data and map making. --- ## My advice on how to advertise your knowledge .pull-left[ Here's what I have in my CV. My "Skills and Interests" section is either the first or second in my CV for industry job. Make it explicit what you know. ] .pull-right[ <img src="data:image/png;base64,#Images/cvsample.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Before I open to a Q&A Discussion Thank you for listening to my guest lecture. If you are learning R, I highly recommend this reference website: https://r-graph-gallery.com/ Please stay in touch: * Twitter: @SeriFeliciano * LinkedIn: www.linkedin.com/in/jmtfeliciano/ * Email: jfeliciano@aya.yale.edu